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PARALLEL  SOLUTION  OP  LINEAR  SYSTEMS  WITH 


STRIPS)  SPARSE  MATRICES 
PART  1:  VLSI  net«#orlcs  for  striped  matrices 

'■St 

Rami  Melhem 

ABSTRACT 

The  multiplication  of  a  matrix  by  a  vector  and  the 
solution  of  triangular  linear  systems  are  the  most  demanding 
operations  In  the  majority  of  Iterative  techniques  for  the 
solution  of  linear  systems.  Data-driven  VLSI  networks  that 
perform  these  t%#o  operatlons»  efficiently,  for  sparse 
matrices  are  Introduced.  In  order  to  avoid  computations 
that  Involve  zero  operands,  the  non-zero  elements  In  a 
sparse  matrix  are  organized  In  the  form  of  non  Intersecting 
stripes,  and  only  the  elements  within  the  stripe  structure 
of  the  matrix  are  manipulated.  Detailed  analysis  of  the 
netviorks  proves  that  both  operations  may  be  completed  in  n 
global  cycles  with  minimal  comnunlcatlon  overhead,  %diere  n 
Is  the  order  of  the  linear  system.  The  number  of  cells  In 
each  network  as  well  as  the  conniunlcatlon  overhead  are 
determined  by  the  stripe  structure  of  the  matrix.  A  compsun- 
lon  paper  [12]  ~^gxamlnes  this  structure  for  the  class  of 
sparse  matrices  generated  In  finite  Element  Analysis.  ^ 

*)  This  %irork  Is,  In  part,  supported  under  ONR  contract 
N00014-85-K-0339. 

**)  On  leave  from  the  Dept,  of  Computer  Science,  Purdue 
Unlv.,  West  lafayette,  IN  47907. 


Iterative  solvers  for  large  sparse  linear  systems  of 
equations  are,  once  again,  becoming  more  and  more  competi¬ 
tive  with  direct  solvers  [6,7].  In  this  paper,  we  consider 
the  two  basic  operations  that  constitute  the  bulk  of  the 
work  in  most  Iterative  methods.  Namely,  the  multiplication 
of  a  matrix  by  a  vector,  and  the  solution  of  triauigular 
linear  systems.  The  computations  involved  in  these  two 
operations  are  quite  regular  and  thus,  naturally  amenable  to 
efficient  implementations  on  regular  VLSI  networks;  that  is 
systolic  [9]  euid  data-driven  (sometimes  called  self-timed) 
[10]  arrays. 


However,  large  systems  that  appear  in  practice  are  usu¬ 
ally  sparse,  and  hence,  seem  to  be  inefficient  for  solution 
on  regular  VLSI  networks.  More  specifically,  if  C%  of  the 
elements  in  the  coefficient  matrix  are  zeroes,  then  C%  of 
the  resources  in  the  network  are  wasted  in  trivial  opera¬ 
tions  that  involve  zero  operands,  and  the  corresponding  data 
communicat ions . 
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In  order  to  avoid  this  waste  and  to  retain  the  atdvan- 
tage  of  having  fast  specialized  cells  and  efficient  local 
communications,  it  is  suggested  l!h  [13]  to  consider  regular 
data-driven  networks  that  are  designed  for  operations  on 
dense  matrices,  and  then  to  add  to  each  cell  in  the  networks 
the  capability  of  recognizing  and  skipping  trivial  opera¬ 
tions.  A  detailed  study  of  the  performance  of  the  resulting  id  and/or 
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networks  shows  large  speed  ups  for  highly  sparsed  matrices. 

In  this  paper,  we  present  a  different  approach  for 
using  regular  data-driven  networks  in  sparse  matrix  manipu¬ 
lation.  It  is  also  bamed  on  performing  only  non-trivial 
operations,  but  is  primarily  aimed  at  reducing  the  number  of 
computational  cells  in  the  netvrork,  rather  than  increasing 
its  speed. 

In  order  to  be  more  specific,  consider,  for  example, 
the  multiplication  of  an  nxn  banded  matrix  A  by  a  vector. 
This  operation  may  be  performed  in  2n'»-b  cycles  on  a  data- 
driven  (or  systolic)  network  [9]  that  uses  b  cells,  where  b 
is  the  band  width  of  A.  If  A  is  sparse,  then  the  approach 
of  [13]  uses  the  same  number  of  cells  b,  but  takes  advantage 
of  the  sparsity  of  A  to  reduce  the  multiplication  time  con¬ 
siderably.  On  the  other  hand,  the  approach  presented  here 
takes  aidvantage  of  the  sparsity  of  A  to  reduce  the  number  of 
cells  to  a  number  ir,  ^Ich  is,  usually,  much  smaller  than  b. 
The  multiplication,  in  this  case,  is  completed  in  approxi¬ 
mately  n  cycles. 

The  reduction  of  the  number  of  computational  cells  is 
based  on  the  assumption  that  the  non-zero  elements  of  the 
matrix  are  located  in  a  few  stripes  which  are  almost  paral¬ 
lel  to  the  diagonal  of  the  matrix.  A  specific  cell  is  then 
assigned  to  perform  the  operations  associated  with  the  ele¬ 
ments  of  a  particular  stripe.  Clearly,  the  crucial  issue  is 
to  choose  a  stripe  structure  that  minimizes,  or  even 


eliminates r  data  conflict. 
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The  netirarlc  that  we  introduce  here  is  especially  suit¬ 
able  for  the  type  of  matrices  that  result  in  finite  element 
analysis.  More  specifically,  for  this  type  of  matrices,  the 
band-width  depends  on  the  order  of  the  matrix  (usually, 
b-0(  fT)  ),  While  the  number  of  stripes  n  is  bounded  by  a 
small  number  vdiich  depends  on  the  maximum  number  of  elements 
that  may  share  a  particular  node.  In  other  words,  unlike  in 
[13],  the  size  of  the  network  is  independent  of  the  size  of 
the  problem. 

At  this  poinc,  we  should  mention  that  other  different 
approaches  have  been  suggested  for  the  parallel  solution  of 
sparse  linear  systems.  These  include  the  application  of 
content-addressable  VLSI  networks  [18],  and  data  flow  archi¬ 
tectures  [15,16]  to  the  minimization  of  conflicts  in  data 
access,  the  use  of  networks  with  interconnections  that 
reflect  the  underlying  graph  structure  of  the  matrix  [1,2], 
and  the  use  of  multiprocessors  with  general  interconnections 
[3,4]. 


We  start  in  Section  2  by  defining  the  stripe  structure 
of  a  general  sparse  matrix.  In  Section  3,  we  describe  a  VLSI 
net%rark  that  utilizes  the  stripe  structure  in  the  parallel 
multiplication  of  a  matrix  by  a  vector.  Then,  we  introduce, 
in  Section  4,  the  property  of  non-overlapping  stripes  and  we 
show  that,  if  this  property  is  satisfied  in  the  input 
matrix,  then  the  multiplication  does  terminate  in  n  global 
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cycles ,  where  a  global  cycle  Includes  some  communication 


activities,  and  a  multiply/2u3d  operation.  In  Section  5,  we 


estimate  the  comnunication  overhead  in  each  global  cycle. 


and  finally,  in  Section  6,  we  modify  the  matrix/vector  mul¬ 


tiplication  network  and  obtain  a  network  for  the  solution  of 


triangulair  lineeu:  systems. 


2.  STRIPE 


SPARSE  matrices 


We  define  a  stripe  structure  of  a  sparse  matrix  to  be  a 
set  of  stripes  that  are  almost  parallel  to  the  diagonal  of 
the  matrix,  2uid  that  contains  all  its  non-zero  elements. 
More  specif ic2Llly,  given  an  nxn  matrix  A,  with  lower  and 
upper  b£uidwidthes  b^  and  b2r  respectively,  we  augment  the 
set  T  »  {(ifj);  l<i,j<n}  of  positions  of  A  with  the  two  tri¬ 
angles  »  {(irj);  i*l»  —  ,bj^,  j-i-bj^, . . .  ,0)  and  - 
{ ( i, j ) ;  i-n-b^+lf • • • »n,  j«n+l, . . . , i+b^) ,  and  assume  that 
a^  for  (l,j)eT^  U  T^.  This  expands  the  set  of  allowetble 
positions  of  A  to  include  the  band  ((irj); 
l^l^n,  i-bj^^j^i+b^) .  Now  we  may  define  the  following: 

Definition  1:  Let  I^-{1, . . . ,n} .  A  stripe  3  of  the  matrix  A 
is  a  set  of  positions  S  -  {(l,o(l))  ;  ie  Ig  I^) ,  where  a  is 
an  increaising  function;  that  is,  if  i  <  j  and 
(i,o(l)), (j,o(j))  €  S,  then  a(i)  <  o(j).  If  S  contains  one 
entry  for  each  row  of  A;  that  is  S  -  {(i,o(i))  ;  iel^),  then 
S  is  called  a  full  stripe. 

Definition  2:  TWO  stripes  «  {(i,a^(i))}  euid  S2  ~ 
{( 1,02(1))}  are  ordered  by  the  relation  <  S2  (S^  is  less 
than  S2)  if  for  any  i  in  the  domain  of  o^,  and  j  in  the 
domain  of  O2, 

i  a  j  implies  that  Oj^Ci)  <  ©2(3)  • 

Note  that  if  and  S2  are  full  stripes,  then  3^^  <  S2  if 


D«f  in  it  ion  2.:  A  Stripe  structure  of  a  matrix  A  is  a  sequence 


of  ir  stripes  S,  <...<  S  ,  such  that  a.  .‘■0  it 

X  if  1  r  J 

(i,j)/^  S^U...U  (see  Pig  1(a),  v^ere  *.*  auid  ’x*  indicate 
a  zero  and  a  non-zero  element,  respectively,  auid  each  ele¬ 
ment  included  in  a  stripe  is  enclosed  in  a  circle). 


(a)  with  5  stripes  (b)  with  5  full,  parallel,  stripes 
Pig  1  -  Examples  of  stripe  structures 

A  special  class  of  stripe  structures  is  the  class  of 
structures  with  parallel  stripes,  in  the  sense  that  each 
stripe  has  the  form  ((ifi+s^)  ;  Le  Ig  I^},  for  some  con¬ 
stant  s^.  Hatrices  that  can  be  covered  by  parallel  stripes 
occur  frequently  in  practice,  especially  in  the  solution  of 
partial  differential  equations  on  rectangular  grids.  Por 
example,  if  the  nodes  of  the  grid  are  numbered  regularly, 
and  a  five  point  star  approximation  is  used  to  discretize 
the  differential  equation,  then  the  resulting  coefficient 
matrix  may  be  covered  by  five  parallel  full  stripes  (see  Pig 
1(b)).  Similarly,  if  finite  element  amalysis  is  used  with 
3,  4,  6  or  9  node  Lagrangian  elements,  then  the  resulting 


stiffness  matrix  may  be  covered  by  7,  9,  19  or  25  parallel 
full  stripes r  respectively.  Note  that  in  order  to  obtain 
full  stripes  in  the  exeuaples  of  Fig  1(b),  we  include  in  each 


stripe  few  positions  (i,j)  for  which  a.  .»0.  Note  also  that 

j 

if  the  finite  element  grid  is  not  recteuigular ,  which  is  usu¬ 
ally  the  case,  then  the  resulting  matrix  may  not  be  covered 
by  parallel  stripes. 


Matrices  with  parallel  full  stripes  may  be  efficiently 
stored  diagonal  by  diagonal.  This  diagonal  storage  scheme 
may  be  easily  extended  to  matrices  with  general  stripes. 
More  specifically,  given  a  matrix  A  with  it  stripes,  we  may 
store  the  elements  of  the  Ic  stripe  in  the  column  of  an 
nxjr  rectangular  array  E^,  such  that,  for  i»l,...,n  and 
l,...,Tr, 


'i,Oj^(i)  if  (i,aj^(i))€Sj^ 

E^(ifk)  "  \q  otherwise  (!•») 

In  addition  to  E^,  another  nxir  integer  eurray  is  needed  in 

order  to  store  the  values  of  In  other  words. 


-  1-1 


V‘> 

b. 


if  (i,Oj^(i))€Sj^ 

otherwise 


(l.b) 


Note  that  -bj^  is  not  a  valid  column  number  in  T  0  U  T2. 
Note  also  that  a  more  efficient  scheme  for  storing  the 
stripes  of  A  may  be  obtained  if  we  compact  every  column  k  of 
E^  and  by  storing  only  the  entries  corresponding  to  the 
elements  of  3^,  and  then  keep  the  compacted  columns  of  E^ 
and  P^  in  two  linear  arrays.  Of  course,  an  additional  array 


is  needed  in  this  case  in  order  to  keep  track  of  the  row 

number  of  each  element  in  E.. 

A 

Clearly,  many  stripe  structures  may  be  constructed  for 
a  given  sparse  matrix.  Among  the  different  structures  for 
any  banded  matrix  is  the  structure  obtained  by  considering  b 
parallel  stripes,  where  b  «  b^-t-b^+l  is  the  band-width.  How¬ 
ever,  in  order  to  take  full  advantage  of  the  above  stripe 
storage  scheme,  we  should  determine  the  stripe  structure  of 
the  matrix  that  minimizes  the  number  of  stripes.  An  algo¬ 
rithm  that  constructs  the  optimal  stripe  structure  for  any 
specific  sparse  matrix  is  given  in  the  Appendix. 

The  efficiency  of  the  stripe  storage  scheme  for  any  nxn 

matrix  A  striped  with  ir  stripes  is  given  by  the  ratio 
%diere  r^  is  the  number  of  non-zero  elements  in  A.  Although, 
this  ratio  may  be  low  for  general  sparse  matrices,  it  is 
shown  in  [12]  that  the  stripe  scheme  is  very  efficient  for 
the  type  of  matrices  resulting  from  the  discretization  of 
partial  differential  equations.  Moreover,  the  stripe  scheme 
has  an  important  advanta^je  over  other  sparse  schemes  [8], 
namely,  it  is  a  regular  scheme  that  may  be  explored  effi¬ 
ciently  in  parallel  processing. 


1. 


QZ.  A  STRIPED  MATRIX  fix  A  VECTOR 


1.1.  A  VLSI  data  dr iven  network 

The  systolic  network  given  in  [9]  for  the  multiplica¬ 
tion  of  a  beurided  matrix  A  with  a  vector  x  uses  b  cells  auid 
completes  the  computation  of  the  product  vector  y“Ax  in  2n+b 
cycles,  where  n  and  b  are  the  order  and  band-width,  respec¬ 
tively,  of  A.  In  this  section,  we  modify  this  network  such 
that  if  A  has  n  stripes,  ir  <<  b,  then  the  vector  y  may  be 
computed  using  a  network  of  only  n  computational  cells.  In 
order  to  be  consistent  with  our  future  notation,  we  denote 
the  stripes  of  A  by  S^,  k— tTj^,  . . .  ,112,  where  7rj^+7r2+l“7r. 


Fig  2  -  A  network  (MAT/VEC)  for  the  multiplication  of 
a  striped  matrix  by  a  vector 

In  Fig  2,  we  show  the  modified  network  that  we  call 
from  now  on  MAT/VEC.  Each  cell  in  MAT/VEC  has  five  input 
ports,  namely  I^,  r-l,..,5,  and  two  output  ports,  namely  0^^ 
and  O2.  The  elements  of  the  vector  x  are  fed  to  the  network 
from  port  I^  of  the  first  cell  and  the  elements  of  the 
result-vector  y,  initialized  to  zero,  are  fed  from  the  port 
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I 2  of  the  last  cell.  Successive  non-zero  elements  in  a  par¬ 
ticular  stripe  are  supplied  on  port  of  cell  Ic  in 

increaising  row  order.  Along  with  each  element  a^  ^  (i)^ 

supplied  on  of  cell  k,  the  values  of  i  and  o^(l)  are  sup¬ 
plied  on  the  input  porta  and  respectively.  Note  that 
the  row  index  supplied  on  may  be  eliminated  if  the 

stripes  are  full  or  if  the  elements  of  column  Ic  of 
defined  by  (1),  rather  than  the  elements  of  3^^,  are  supplied 
to  cell  k.  In  this  case  an  internal  counter  may  be  used  to 
keep  track  of  the  value  of  i. 

Each  communication  link  .  directed  from  cell  q  to 
cell  k  in  KAT/VEC  is  regarded  as  a  queue.  Only  cell  q  may 
vrrite  on  this  queue  and  only  cell  k  may  read  (and  delete) 
its  front  element.  For  simplicity,  we  will  assume,  for 
now,  that  the  queues  have  unlimited  caqpacity.  Then,  we  will 
derive  in  Section  5  the  actual  size  of  the  queues  needed  for 
proper  operation.  Note  that  in  practice,  any  communication 
link  is  just  a  connector,  auid  hence,  £Uiy  queues  associated 
with  the  link  should  be  physically  located  in  its  source  or 
destination  cells  (or  distributed  among  the  two). 


In  order  to  provide  the  flexibility  needed  to  deal  with 
sparse  structures,  we  assume  that  each  computational  cell 
contains  two  counters  CX  auid  CY  to  keep  track  of  the  indices 
of  the  data  received  on  I2  and  I^,  respectively.  We  also 
assume  that  the  network  is  data  driven,  that  is  the  cycle  of 
each  computational  cell  is  controlled  by  the  availad>ility  of 


the  input  data.  Finally,  in  order  to  study  the  effect  of 
internal  data  conflicts  on  the  operation  of  the  network,  we 
assume  that  external  data,  that  is  data  on  ports  I^,  and 
Ig,  as  well  as  port  of  cell  and  port  I2  cell 
are  always  availedsle  %«hen  needed.  With  this,  the  operation 
of  each  cell  k  may  be  described  by  the  following  cycle  i^ich 
is  executed  repeatedly  by  the  cell:  ([I^]  denotes  the  con¬ 
tent  of  I^,  and  «-  Rx  meauis  that  the  value  stored  in  the 
internal  register  Rx  is  written  on  port  O^). 

CYCLE  It  /*  Initially,  CX-CY-0  */ 

1)  Ra  -  [I3]  ;  Rj  -  [I^]  ;  Ri  -  [Ig] 

2)  Do  steps  2.1  and  2.2  in  parallel 

2.1)  wait  until  data  is  availadole  on 

R*  -  ;  CX  -  CX  +  1 

If  CX  <  Rj  Then  {  -  Rx  ?  Go  To  2.1  ) 

Else  JOIN  step  2.2. 

2.2)  wait  until  data  is  available  on  1^ 

Ry  -  [I^]  ?  CY  -  CY  +  1 

If  CY  <  Ri  Then  {  -  Ry  ;  Go  To  2.2  ) 

Else  JOIN  step  2.1. 

3) Ry-Ry+Ra*Rx 

4)  Oj^  -  Rx  ;  0^  *-  Ry. 

•• 

More  descriptively,  after  a  cell  k  receives  a.  _ 

irOj^(i) 

(step  1)  it  continues  to  transmit  the  components  of  x  from 
I^  to  Oj^  (step  2.1),  and  the  components  of  y  from  1 2  to  03 
(step  2.2),  until  it  finds  x  ...,  and  y . .  At  this  time. 


the  Inner  product  is  confuted  (step  3),  and  the  results  are 
written  out  (step  4).  The  JOIN  statements  in  2.1  and  2.2 
indicate  that  step  3  should  not  start  before  both  steps  2.1 
and  2.2  are  coiqpleted.  Note  that  the  parallel  execution  of 
steps  2.1  and  2.2  guarantees  that  if  auiy  of  the  x  or  y  data 
streams  are  blocked,  the  other  stream  may  continue  flowing. 
This  parallel  execution  may  be  simulated  by  a  busy  wait  loop 
that  polls  ports  and  I2  for  data.  More  specifically,  we 
may  replace  step  2  in  CYCLE  1  by: 

2)  While  (CX  <  Rj)  or  (CY  <  Ri)  Do 

2.1)  If  (CX  <  Rj)  and  (data  is  available  on  I^)  Then 

{  Rx  -  [Ij^]  ;  CX  -  CX  +  1  ; 

If  (CX  <  Rj)  Then  -  Rx  ) 

2.2)  If  (CY  <  Ri)  and  (data  is  available  on  I2)  Then 

{  Ry  -  CI2)  ;  CY  -  CY  +  1  ; 

If  (CY  <  Ri)  Then  O2  **  Ry  ) 


Next,  we  show  that  the  network  described  above  does 
compute  the  elements  y^,  i-l,...,n,  of  the  product  vector 
y-Ax  correctly  if  the  matrix  A  has  non- intersecting  stripes. 


2.2.  EtQOf.  qL 


The  following  properties  of  the  input  will  be  used: 


PI)  If  a..  ^  and  a„  ^  ,  are  the  inputs  to  port  I,  of 

u,o^(u)  V,Oj^(V)  *^3 

cell  k  at  two  consecutive  cycles,  t-1  and  t,  respec¬ 
tively,  then  V  >  u. 


m 


P2)  From  PI  and  Definition  1,  we  have  >  Oj^(u) . 

P3)  The  input  matrix  A  is  striped  according  to  Definition 

3;  that  is,  S^<S  if  k<q. 

K  q 

Let  us  first  assume  that  the  network  will  not  reach  a 
decui  state,  that  is  every  cycle  t  of  any  cell  k  will  ter¬ 
minate.  From  the  operation  of  each  cell  (CYCLE  1),  and  PI, 
it  is  clear  that  every  element  y^  that  is  read  by  cell  k 
(from  I^)  during  cycle  t  satisfies  u<i<v.  If  i-v,  then  the 
term  [a.  ^  x^  is  accumulated  in  y.  before  y.  is 

written  on  On  the  other  hand,  if  u<i<v,  then  y^  is 

copied  unmodified  to  02*  But  in  this  case,  PI  guarantees 
that  (i,o.  (i))  ^  S-  ,  because  otherwise  a.  _  ...  should  have 

K  K 

been  supplied  to  after  a^  ^  and  before 

Given  that  auiy  element  of  A  that  does  not  belong  to  some 
stripe  is  equal  to  zero,  we  conclude  that  y^  accumulates  all 


the  non  zero  terms  of 


•  a 

D  during  its  flow  from  cell 

3-1  J 


to  cell  *2* 


m 


m 


A  deadlock  configuration 
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In  order  to  complete  the  proof,  we  need  to  show  that 
the  network  does  not  reach  a  deadlock  state,  where  a  cell  k 
is  blocking  the  y  data  stream  and  another  cell  q,  q>k,  is 
blocking  the  x  stream  (see  Pig  3).  More  specifically,  a 
state  in  Which 

1)  cell  q  is  waiting  for  some  y^  that  is  locked  behind 

cell  k,  and 

2)  cell  k  is  waiting  for  some  x  .  .  that  is  locked 

k'  ^ 

behind  cell  q. 

Assume  that  this  deadlock  state  is  reached,  and  that  the 
data  appearing  on  ports  and  I 2  of  cells  q  and  k,  respec¬ 
tively,  are  x^  and  y^.  Hence, 


1  a  i 


The  fact  that  y^  is  not  copied  to  port  O2  of  cell  k  and  x^ 
is  not  copied  to  port  of  cell  q  implies  that  iav  and 
But  i>v  may  only  be  satisfied  if  the  previous 
input  on  I-  of  cell  k,  say  a„  ^  ,  satisfies  u-i-1,  \dilch 

means  that  uav  and  contradicts  PI.  Similarly,  we  may  show 
that  j>o  (i)  contradicts  P2.  Hence 


i  -  V  and  j  -  (4) 

From  (3)  and  (4),  we  get 

V  a  i  and  a|_(v)  a  o  (1) 

K  q 

which  contradicts  Definition  2  for  and  hence,  given 

that  k<q,  contradicts  property  P3  of  the  input. 
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2..1.  PaaudQ  gyatiQiic  gynchionlzat ion 


i 


<..1  ■ 


7- 

.V 


i 
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In  order  to  study  the  behavior  euid  estimate  the  execu¬ 
tion  time  of  the  netwrlc  MAT/VEC  ,  we  follow  the  technique 
suggested  in  [13]  for  the  study  of  self -timed  computations. 
Namely,  we  introduce  a  simpler,  hypothetical,  computation 
(called  a  pseudo  systolic  computation)  that  is  obtained  by 
forcing  some  synchronization  on  the  self-timed  conqputation. 
The  additional  synchronization  may  only  slow  down  execution, 
and  hence  the  execution  time  of  the  pseudo  systolic  computa¬ 
tion  forms  an  upper  bound  on  the  execution  time  of  the 
se If -t imed  computat ion . 

A  pseudo  systolic  version  of  the  self -timed  computation 
discussed  in  this  section  may  be  obtained  by  replacing  step 
3  in  CYCLE  1  by  the  following 

3)  wait  for  a  synchronization  signal  SYNC  ; 

Ry  -  Ry  +  Ra  *  Rx 


The  purpose  of  the  SYNC  signal  is  the  synchronization 
of  all  the  cells  such  that  the  execution  of  the  network 
alternates  between  two  phases;  a  comnunication  phase  anA  a 
processing  phase.  During  the  conniunication  phase,  the  data 
flows  in  the  network  until  each  cell  is  either  blocked  wait¬ 
ing  for  data  (in  steps  2.1  or  2.2),  or  waiting  for  SYNC  (in 
step  3).  We  assume  that  all  the  cells  are  connected  to  a 
hypothetical  controller  that  issues  the  signal  SYNC  after  it 
detects  the  termination  of  the  communication  phase.  At  that 
instant,  all  the  cells  that  are  waiting  in  step  3  perform 


the  multiplication  simultaneously,  %^ile  the  other  cells 
remain  idle.  This  is  the  processing  phase.  A  communication 
phase  followed  by  a  processing  .  phase  is  called  a  global 
cycle  of  the  network.  In  this  paper,  we  let  N  be  the  total 
number  of  global  cycles  needed  to  terminate  the  computation, 
and  we  denote  by  CP^  and  PP^,  t«l,...,N,  the  computation 
phase  and  the  processing  phase,  respectively,  of  the  global 
cycle  t. 

Given  that  external  data  on  and  Ig  are  avail¬ 

able  «dien  needed,  we  may  define  the  function  a  : 
[-ir^,ir2]x[l,N]  -•  A  such  that  a(k,t)  is  the  element  of  A  that 
is  stored  in  the  register  Ra  of  cell  k  (reeui  from  port  I^) 
during  the  processing  phase  Although,  for  any  specific 

t,  a(k,t)  is  defined  for  all  k  cells,  some  cells  will  be 
idle  during  PP^r  and  hence,  will  not  operate  on  a(k,t).  Let 
be  the  set  of  cells  that  are  not  idle  during  PP^r  and 
define  the  t^^  computation  front  CF^  as  the  set  of  elements 
of  A  that  are  operated  upon  during  that  phase.  More 
specifically, 

CP^  ■  {a(k,t)  :  k  e  M^) 

The  succession  of  computation  fronts  represents  the 
progress  in  the  execution  of  the  pseudo  systolic  computa¬ 
tion.  More  specifically,  given  a  certain  matrix,  we  may 
connect  the  elements  of  each  computation  front  by  a  piece- 
wise  linear  curve  and  thus  obtain  a  visual  picture  that 
describes  the  propagation  of  the  computation.  For  example. 
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we  show  in  Fig  4  the  different  computation  fronts  that 
result  from  the  pseudo  systolic  execution  of  MAT/VEC  on  the 
matrix  A  of  Pig  l.a.  For  clarity,  we  represent  each  non 
zero  elements,  a^  ^  €  S^,  of  A  by  its  stripe  number  k,  auid 
we  represent  the  zero  elements  of  A  by  dots.  Note  that  the 
concept  of  computation  fronts  is  the  same  as  that  suggested 
in  [17}.  However,  by  allowing  irregular  fronts,  we  are  able 
to  model  data  driven  computations  that  depend  on  the  value 
of  the  input  as  well  as  its  availability. 


5 


8 


Computation  fronts  may  be  constructed  systematically  if 
the  conditions  that  governs  the  relation  between  the  ele¬ 
ments  of  the  fronts  are  known.  In  order  to  derive  these 


conditions,  we  first  observe  that  if  a.  _  ,  . .  is  in  CF«., 


then  both  y^  and  x^  should  be  at  cell  k  during  the  pro¬ 
le 


cessing  phase  Similarly,  ^  same 


front  CF^,  then,  both  and  x^  should  be  at  cell  q  dur- 


k 


.‘•‘a.'Tn 


ing  PPf.*  Assuming  that  q  >  k,  then  the  sequential  order  of 

the  X  and  y  data  streams  requires  that  i  <  i  and 

o  (i)  >  a.  (1),  respectively.  In  other  words  the  following 

q  K 

should  be  satisfied 


q£.  data  flow 


l:  If  a 


l,Oq(i) 


^4  »  ^  same  computation  front,  then 

*  r 


1  <  i 


aq(l)  >  a^(i) 


(5. a) 


If  the  queues  on  the  communication  lines  have  infinite 

capacity,  then  any  number  of  data  items  may  be  buffered 

between  cells  k  and  q.  That  is,  there  is  no  upper  limit  on 

the  values  of  (i-i)  or  (a  (i)-o- (i)),  and  hence  (5. a)  is  the 

H  ^ 

only  necessary  condition  for  a.  ^  and  a,  _  to  bo  in 

t,Oj^(i)  i,Og(i) 

the  same  front.  More  descriptively,  (5. a)  means  that  every 
line  segment  in  a  computation  front  should  have  a  slope  s  on 
the  J  axis  (see  Fig  4)  that  satisfies 


that  is 


-«•  <  tan(s)  <  0 


90®  <  s  <  180° 


(5.b) 


In  addition  to,  the  condition  imposed  on  each  individual 
front,  we  have  to  ensure  that  the  fronts  propagate  in  the 


same  direction.  More  specifically. 


condition;  If  a.  „  €  CF^  and 

if^k^^'  t 


*l,o^(l)  *  C't'  «'•" 
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I 


> 


I 

lii 


i  >  i  (6) 

Now,  given  the  zero  pattern  of  any  matrix  A,  we  may 
construct  the  computation  fronts  of  A  as  follows: 


ALGrl  :  /*  Construction  of  the  computation  fronts  */ 


1)  Start  from  the  upper  left  corner  of  A  cund  construct  CF^, 
such  that. 

Cl)  It  includes  as  many  nonzero  elements  of  A  as  possible 
C2)  Condition  (5)  is  satisfied,  and 

C3)  All  elements  of  A  enclosed  between  CF^  euid  the  two 
axes  are  zeroes  (implied  by  condition  (6)). 

2)  For  t-2,3,...,  repeat  until  every  non  zero  element  of  A 
is  in  some  front 

2.1)  Given  construct  CF^  such  that 

Cl)  It  includes  as  many  nonzero  elements  of  A  as  possible 
C2)  Condition  (5)  is  satisfied,  and 
C3)  All  elements  of  A  enclosed  between  CF^_^  and  CF^ 
are  zeroes  (implied  by  condition  (6)). 


By  the  definition  of  the  pseudo  systolic  network,  all 
possible  communications  are  performed  before  the  beginning 

of  a  processing  phase.  Moreover,  every  cell  that  receives 

$ 

all  its  operands  during  the  communication  phase,  executes 
step  3  of  CYCLE  1  upon  reception  of  the  SYNC'  signal.  For 
these  reasons,  we  construct  the  computation  fronts  by 
including  in  each  front  as  many  elements  of  A  as  possible. 


Large  matrices  that  appear  in  practical  applications 
have  usually  non  zero  diaigonal  elements.  For  this  type  of 


••  '  IN 
XA 


matrices,  we  may  establish  the  following  lower  bound: 


Propoaition  If  A  is  an  nxn  matrix  with  non  zero  dieigonal 
elements,  then  at  least  n  computation  fronts  are  required  in 
order  to  cover  all  the  nonzero  elements  of  A. 


Proof :  Each  diagonal  element  should  be  in  some  front,  and 
condition  (5)  does  not  allow  a  single  front  to  include  more 
than  one  diagonal  element.  [] 


In  order  to  establish  an  upper  bound  on  the  number  of 
computation  fronts,  we  study  the  question  of  not  including 
in  each  front  as  many  elements  of  A  as  possible.  More 
specifically,  assume  that  during  the  construction  of  a 
particular  element  a.  _  ...  can  be  included  in  CF.  .  This 

i,Oj^(i)  t 

means  that  at  the  end  of  the  communication  phase  CP^,  cell  k 
is  waiting  in  step  3  of  CYCLE  1.  The  exclusion  of  a .  _  , . . 

from  CF^  may  only  result  if  cell  k  remains  idle  during  the 
processing  phase  PP^,  say  because  SYNC  did  not  reach  that 
cell  due  to  some  transmission  error.  Although  this  error 
does  not  cause  a  failure  of  the  computation,  it  does  alow  it 
do%m  because  cell  k  will  stay  at  step  3  of  CYCLE  1  waiting 
for  the  SYNC  signal  of  the  next  global  cycle. 


Hence,  any  set  of  computation  fronts  that  satisfy  con¬ 
ditions  (5)  and  (6)  corresponds  to  some  execution  of  the 
pseudo  systolic  network  with  unreliable  broadcast  of  SYNC. 
In  order  to  reserve  the  term  computation  fronts  to  the  sets 
that  are  constructed  by  ALGl  and  that  correspond  to  the 


execution  of  a  reliable  pseudo  systolic  network,  we  Intro¬ 
duce  the  following  definition: 


Def in it ion  A:  If  condition  Cl  is  removed  from  ALGl,  then  any 
set  of  fronts  that  result  from  the  construction  is  called  a 
set  of  contours  of  the  matrix  A>[] 

Clearly,  the  number  of  computation  fronts  that  cover  A 
is  less  than  or  equal  to  the  number  of  contours  in  any  set 
of  contours  that  covers  A.  This  may  be  restated  as  follows: 

Proposition  2:  Given  a  matrix  A,  If  we  may  include  all  the 
non  zero  elements  of  A  in  N  contours  that  satisfy  (5)  auid 
(6),  then  the  pseudo  systolic  execution  of  MAT/VEC  ter¬ 
minates  in  at  moat  N  global  cycles. 

In  the  next  section  we  study  a  type  of  matrices  for 
vdiich  the  upper  bound  on  the  number  of  computation  fronts 
provided  by  Proposition  2  coincides  with  the  lower  bound 
established  by  Proposition  1. 


A.  MATRICES  WITH  MQM-QVERLAPP INQ  STRIPES 

Let  and  be  two  full  stripes.  By  Definition  2, 

<  if  Oj^( i)  <a2( i) »  i"l» _ rn.  A  more  restrictive  condi¬ 
tion  may  be  obtained  if  we  require  that,  for  every 

i»2,...,n,  the  intervals  Oj^(i)-Oj^( i-1)  auid  a2( i)-02( 
not  overlap.  That  la,  if 

Oj^(i)  a  o^Ci-l)  i-2,...,n 

The  following  definition  extends  this  simple  condition  to 
stripes  that  are  not  full. 

Def in it ion  A:  The  n  stripes  of  a  matrix  A  are  said  to  be 
non-overlapping,  if  for  any  stripe  -Vj^akaTr^,  and  auiy 
element  ( i ,  i ) )  have 

where  m  is  the  smallest  positive  integer  such  that 
(i-m,Oj^^j^(i-m) )  e  If  the  inequality  in  (7)  is  strict, 

that  is  <  replaces  a,  then  the  stripes  of  A  are  called 
strictly  non  overlapping.  [] 

For  example,  the  matrix  shown  in  Fig  5.b  has  strictly 
non  overlapping  stripes,  while  the  matrix  shown  in  Fig  5. a 
has  overlapping  stripes.  The  positions  where  overlap  occurs 
are  indicated  on  the  figure. 


The  following  lemma  may  be  easily  proved  by  induction 


Lemma  1:  The  it  stripes  of  A  are  non-overlapping  if  emd  only 


if  for  any  k,  and  integers  i  oU\d  m,  such  that 

(i,o^(i)  )€Sj^,  and  ( i-m,Oj^^jjj(i-m) )  equation  (7)  is 

satisfied.  [] 


\ 


Nc 


(a)  A  matrix  with 
overlapping  stripes 


(b)  The  same  matrix  with 
strictly  non  overlapping  stripes 


Fig  5 


The  property  of  strictly  non  overlapping  stripes 
guarantees  that  if  both  a.  ^  ...  and  y.  are  at  cell  k  dur- 

ing  a  specific  global  cycle  t,  then  x^  may  not  be  locked 

behind  another  cell  k+m,  m>0,  and  hence  should  arrive  at 
cell  k  during  the  same  global  cycle  t.  In  other  words,  the 
computation  is  not  delayed  due  to  internal  data  conflicts. 
This  is  formalized  by  the  following  proposition: 


Proposition  2:  Let  A  be  a  matrix  with  non-zero  diagonal  ele¬ 
ments.  If  A  is  striped  such  that  all  its  diagonal  elements 
are  covered  by  one  stripe  and  all  its  stripes  atre  strictly 
non-overlapping,  then  the  pseudo  systolic  computation  of 
MAT/VEC,  with  input  matrix  A,  terminates  in  exactly  n  global 
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cycles. 


Proof:  Let  be  the  stripe  count  of  A,  and  denote 

the  stripes  of  A  by  Sj^,  _ ,0, . . .  ^ere 

auid  Sg  is  the  stripe  that  contains  all  the  diagonals.  Con¬ 
struct  for  each  r~lf...,n  a  contour  C_  that  includes  a_ 

r  i>  f  r 

and  has  a  slope  of  one  stripe  per  row.  That  is  includes 

for  each  k-7ir^, . . .  the  element  (if  any)  of  stripe  k 

%diich  is  at  row  r-k.  More  specifically, 

'=t  ■  <*r-k,o^(r-k)  =  ‘  V  <») 

For  any  specific  a.  ^j*0,  there  exists  a  unique  k  such 
that  a^  ^  e  S^,  that  is  o^(i)*j.  Hence,  there  exists 
exactly  one  contour  that  includes  a^^  j,  namely  In 

other  words,  the  contours  C^,  r-l,...,n  cover  all  the  non 
zero  elements  of  A.  Moreover,  If  a„  ^  .  .  e  C..  and 

v-kfCklv-k)  V 

a„  .  ^  ,  6  C.,  are  in  the  same  stripe  S.  ,  then  v-k>u-k 

u— K,Oj^QU-K)  U  K 

Implies  that  v>u.  That  is  the  condition  of  unidirectional 

propagation,  namely  equ  (6),  is  satisfied. 


It  remains  to  prove  that  the  consistency  of  data  flow 
condition,  namely  equ.  (5),  is  satisfied.  Let  (i,o^(l))  and 
(i,Og(i)  be  any  two  elements  in  C^.  If  q  >  k,  then  from  the 
definition  of  C^,  1-r-k  and  i-r-q,  and  hence  i  <  1.  But  the 
stripes  of  A  eu:e  strictly  non  overlapping,  and  thus,  by 
using  m-q-k  in  (7),  we  obtain 


v^ich  satisfies  (5),  euid  conflates  the  proof  that  all  the 
non  zero  elements  of  A  may  be  covered  by  n  contours  that 
satisfy  (5)  and  (6).  The  result  of  the  proposition,  then, 
follows  directly  from  Propositions  1  aund  2.  [] 

Proposition  3  applies  only  to  matrices  with  strictly 
non  overlapping  stripes.  A  similar  result  may  be  obtained 
for  non  overlapping  stripes,  if  we  weaken  the  condition  on 
computation  fronts  (contours)  such  that  the  line  segments  of 
a  specific  front  (contour)  may  be  vertical.  That  is  <7g(^) 
and  in  (5. a)  may  be  equal,  which  means  that  a  com¬ 

ponent  of  the  X  data  stream  may  be  at  t%ra  different  cells  k 
and  q  during  the  same  processing  phase.  This  may  be 
achieved  if  each  cell  in  MAT/VEC  writes,  immediately,  on  0^ 

the  value  of  x  that  it  reads  from  More  precisely,  the 

operation  of  each  cell  (CYCLE  1)  should  be  modified  to  the 
following: 

CYCLE  2;  /*  Initially,  CX-CY-0  »/ 

1)  Ra  -  [I3]  ;  Rj  -  [I^]  ;  Ri  -  [I5] 

2)  Do  steps  2.1  and  2.2  in  parallel 

2.1)  wait  until  data  is  available  on 

Rx  -  [Ij^l  ;  -  Rx  ;  CX  -  CX  +  1 

If  CX  <  Rj  Then  Go  To  2-1 

Else  JOIN  step  2.2. 

2.2)  wait  until  data  is  available  on  1 2 
Ry  -  [I^]  ;  CY  -  CY  +  1 

If  CY  <  Ri  Then  {  -  Ry  ;  Go  To  2.2  ) 


Else  JOIN  step  2.1. 


3)  Ry  -  Ry  +  Ra  *  Rx 

4)  O,  -  Ry. 


The  proof  of  the  following  proposition  is  very  similar 
to  that  of  Proposition  3. 


Propos it Ion  A:  Let  A  be  a  matrix  with  non-zero  diagonal  ele¬ 
ments  r  that  is  striped  such  that  all  its  diaigonal  elements 
are  covered  by  one  stripe  and  all  its  stripes  are  non¬ 
overlapping.  If  each  cell  executes  CYCLE  2,  then  the  pseudo 
systolic  computation  of  MAT/VEC,  with  input  A,  terminates  in 
n  global  cycles . [ ] 


Given  a  sparse  matrix  A,  the  advantage  of  constructing 
a  non-overlapping  stripe  structure  for  A  is  clear.  However r 
assuming  that  ir  is  minimum  number  of  stripes  that  may  cover 
A,  and  ir^  is  the  number  of  non-overlapping  stripes  that  may 
cover  A,  thenr  usually,  ir^  >  v.  Hence,  a  trauie  off  should 
be  considered  between  1)  using  a  network  with  cells  that 
terminates  execution  in  n  global  cycles,  or  2)  using  a  net¬ 
work  with  ir  cells  which  requires  an  execution  time  larger 
than  n  global  cycles. 


The  cost  of  the  determination  of  a  non-overlapping 
stripe  structure  for  general  sparse  matrices  is  usually 
high,  tfore  specifically,  the  modification  of  the  algorithm 
given  in  the  appendix  such  that  to  include  condition  (7) 
requires  some  form  of  back-t racking,  which  is  costly.  How- 


everr  for  some  type  of  matricea,  non-overlapping  stripes  may 
be  obtained  for  very  low  additional  costs.  For  example,  for 
the  elates  of  finite  element  stiffness  matrices  considered  In 
[12],  non-overlapping  stripes  may  be  obtained  by  renumbering 
the  nodes  of  the  grid  from  vdilch  the  matrix  Is  generated. 

Besides  the  number  of  global  cycles,  the  execution  time 
of  M21T/VEC,  Is  determined  by  the  time  for  the  completion  of 
each  global  cycle,  vdilch  depends  on  the  communication 
activities  that  takes  place  during  the  communication  phaise 
of  the  cycle.  We  consider  these  activities  In  some  details. 


By  the  definition  of  peeudo  eyetolic  netiirorlce,  no  com¬ 
munication  takes  place  during  the  processing  phases  of  glo¬ 


bal  cycles.  Hence,  during  a  specific  processing  phase 
data  assumes  a  static  profile.  In  other  «#ords,  a  function 
may  be  defined  for  each  data  stream  to  specify  the  data 
items  at  each  computational  cell. 

For  example,  consider  the  x  data  stream  in  MAT/VEC  and 
assume  that  the  register  Rx  is  set  initially  to  an  arbitraury 
value,  say  Xq  (the  value  of  Xq  is  irrelevant  to  the  computa¬ 
tion).  The  x-stream  profile  at  the  processing  phase  PP^  may 
then  be  specified  by  a  function  xP^:  [-7r^,7r2]-*[0,n]  such  that 
xP^(k)  *  j,  «^ere  x^  is  the  value  of  the  register  Rx  in  cell 
k  during  PP^.  The  y-stream  profile  at  PP^  may  be  specified 
by  a  similar  function  yP^:  t-irj^,Tr2)-»[0,n] .  Note  that  the 
registers  Rx  and  Ry  contain  always  some  values,  and  hence 
xP^  and  yP^  are  defined  for  every  cell  k. 

If  kcH^,  that  is  cell  k  is  not  idle  during  then 

xP^(k)  and  yP^(k)  may  be  determined  from  the  computation 
front  CP^.  More  specifically,  if  kcN^,  then  g  (£)  ^ 

for  some  i,  and  hence,  y^  and  x^  are  at  cell  k  during 


PP^.  That  is 


c  ca'. 


«>  yP*.(k)-t  and  xP.(k)-o.(t)  (9) 


We  call  the  values  of  xP^(k)  and  yP^(k),  for  keH^,  the  knots 
of  the  profiles.  On  the  other  hand,  if  k/Of^,  then  the 


m 
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values  of  xP^(lc)  and  yP^()c)  may  not  be  determined  from  a 
simple  formula.  However,  from  the  specification  of  the 
operation  of  each  cell  (CYCLE  1),  it  is  clear  that  the  fol¬ 
lowing  properties  should  be  satisfied  for  . . .  rir^-l  and 


any  t: 


1 . 


<  xP^(k+l) 
xP^(k+l) 


> 

?t(J^)  {  ^ 


>  yP^Ck+l) 
a  yP^(k+l) 


if  k-i-1  e 


if  k+1  ^ 


if  k  e 


if  k  ^  M. 


(10. a) 


(10. b) 


Note  that  it  is  possible  that  xP^(k) -xP^(k-fl)  if 
k-t-l/M^.  More  specifically,  %dien  cell  k+1  is  waiting  for  a 
new  input,  its  register  Rx  keeps  the  old  value  of  x  that  haus 
been  written  on  during  CP^.  If  this  value  is  also  re^ 
by  cell  k  during  CP^,  then  the  registers  Rx  in  both  cells  k 
and  k+1  contain  the  same  value  during 

Also,  the  elements  of  each  stream  arrive  at  a  specific 
cell  in  order.  That  is,  the  following  is  satisfied  for 


k— . . .  ,^2 


■  \ 

?t(lc)  ^ 


xPt^l(k) 


*  yp/ 


if  k  e 


if  k  ^ 


if  k  e 


(11. a) 


(11. b) 


I  a  y^t+l^*'^  if  k  V-*.-/ 
Again,  the  equalities  may  hold  because  a  register  does 
retain  its  value  if  it  is  not  overwritten  by  a  new  one. 

Equations  (10)  and  (11)  force  on  data  profiles  the  same 


conditions  that  equations  (5)  and  (6)  force  on  computation 
fronts.  In  fact,  it  is  straight  forward  to  derive  (5)  wd 
(6)  from  (10)  and  (II).  Moreover,  given  some  computation 
fronts  which  satisfy  (5)  and  (6),  that  is  which  simulate  an 
execution  of  M\T/VEC,  there  should  exist  some  functions  that 
satisfy  (9),  (10)  and  (11)  and  correspond  to  the  data  pro¬ 
files  during  execution.  However,  the  mathematical  construc¬ 
tion  of  these  functions  is  complex  and  involves  the  solution 
of  a  set  of  simultaneous  inequalities,  namely  (10)  and  (11). 
For  this  reason,  we  seek  further  restrictions  on  the  confu¬ 
tation  fronts  and  data  profiles,  by  limiting  the  communica¬ 
tion  capabilities  of  the  network. 

^.1.  gQinmunicattQn  links  with  limited  buffer  capacity 

A  communication  link  directed  from  a  cell  k  to  cell  k+l 
may  be  regarded  as  a  queue.  Only  cell  k  may  append  to  this 
queue  and  only  cell  k-t-l  may  access  (and  delete)  its  front 
element.  So  far,  we  have  assumed  that  the  communication 
queues  (buffers)  in  MAT/VEC  have  infinite  capacity,  that  is 
d  -d  •«,  where  d  and  d  are  the  capacities  of  individual 

^  T  ^  T 

queues  on  the  x-stream  and  y-streeun,  respectively.  With 
this  assumption,  we  were  able  to  derive  the  conditions  (5) 
and  (6)  which  enable  us  to  construct  the  computation  front 
for  any  given  matrix.  Clearly,  any  limitation  on  d  or  d 

*  4 

represents  some  additional  restrictions  that  should  be  taken 
into  account  during  that  construction. 


More  specifically,  and  without  going  into  the 


implementation  details  of  the  coimnunication  protocols,  if 
and  x^  are  at  cells  k  and  k-fl,  respectively,  during  the  pro¬ 
cessing  phase  PP^r  then  v-u  elements  of  the  x-stream  should 
be  buffered  between  the  two  cells,  which  requires  a  queue 
capacity  of,  at  least,  that  size.  That  is 

xP^(k+l)  -  xP^(k)  a  (12. a) 

Similarly,  for  the  y-stream 

I 

yP^dc)  -  yPt(k+l)  a  dy  (12.b) 

Equations  (12)  may  be  translated  into  restrictions  on 
computation  fronts.  More  precisely,  we  may  derive  the  fol¬ 
lowing  from  (9)  auid  (12): 

BttCCflr  capacity  condition:  if  a^^^  and  a^^^^  q  >  k, 

q  k 

are  in  the  same  confutation  front  CF^r  then 

(13. a) 

and  q  K  Jc 

i  -  i  a  (q-k)  dy  (13. b) 

The  buffer  capacity  condition  is  weaker  than  conditions 
(12)  because  it  restricts  the  collective  capacity  of  the 
links  between  cells  q  and  k,  rather  that  the  capacities  of 
the  individual  links.  In  order  to  clarify  this  point  let 
d^«3,  q-k+2,  i-t+2,  and  Oj^^2(^~2)-aj^(i)-6.  Clearly,  with 

these  values,  (13. a)  is  satisfied,  and  by  definition, 
xP^(k-i-2)-Oj^^2<i’^)  xPt(l«)"0]^(i)  •  Although  both  (12. a) 

and  (13. a)  specify  that  at  most  2d^-6  data  elements  may  be 


buffered  between  cells  k  auid  lc-i-2,  only  (12. a)  specifies  that 
three  of  these  elements  should  be  buffered  between  cells  k 
and  k-hl  and  the  other  three  between  cells  k-t-1  and  k-t'2.  More 
specifically,  (13. a)  does  not  put  any  restriction  on 
xP^(k+l) ,  while  (12. a)  requires  that  xP^(k+l) -Oj^( i)+3 .  Now, 
let  a,  ,  /•  IX  be  in  the  next  computation  front 

i-l,Oj^^j^(i-l)  ^  t+l 

with  Oj^^j^(i-l)«Oj^(i)+2,  that  is  xP^^j^(k+l) -aj^(  i)+2.  It  is 
easy  to  see  that  the  above  data  is  inconsistent  because 
xP^(k+l)  >  xP^^j^(k-H) ,  which  violates  (11. a).  However,  by 
allowing  arbitrary  values  to  xP^(k-fl) ,  the  buffer  capacity 
condition  (13. a)  does  not  detect  this  inconsistency. 

If  conditions  (13)  are  added  to  ALGl  of  Section  3,  then 
computation  fronts  that  satisfy  (5),  (6)  and  (13)  may  be 
constructed  for  any  given  matrix.  However,  because  (13)  is 
weaker  than  (12),  the  constructed  fronts  repr4»sent  the  exe¬ 
cution  of  MAT/VEC  only  if  it  is  possible  to  find  a 
corresponding  data  profile  (with  knots  specified  by  (9)) 
that  satisfy  (10),  (11)  and  (12).  Although  this  technique 
of  constructing  the  computation  fronts  auid  then  checking 
that  they  represent  the  actual  execution  of  the  network  may, 
in  general,  fall,  it  can  be  used  to  show  that  the  results  of 
Propositions  2  and  3  are  independent  of  the  size  of  the 
queues  on  the  y-stream  links. 

Mbre  specifically,  consider  the  minimum  value  of  d^, 

namely  d  >1,  wd  keep  d  It  is  easy  to  check  that  the 

£  ^ 

contours  C^,  r-l,...,n,  used  in  the  proof  of  Proposition  3 
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I  ^ 


i 


f  f.: 


l: 


A- 

i 


do  satisfy  (13.b)  with  Moreover,  let 


yP^(k)  »  r-k  r-l,...,n,  k— (14) 


The  knots  of  this  profile  corresponds  to  (as  specified  by 
(9)).  Also,  (14)  satisfies  (10. b),  (11. b)  and  (12. b)  with 


dy«l.  Hence,  the  n  contours  given  by  (8)  correspond  to 


some  execution  of  MAT/VEC  with 


In  other  words,  if  <Sy"l  suid  d^-*  in  MAT/VEC,  then  the 


execution  of  the  network  for  any  nxn  matrix  with  non  zero 
diagonal  elements  does  terminate  in  n  global  cycles. 
Although  this  is  a  good  result,  it  is  preferable  to  replace 


the  condition  d„“"  by  one  of  the  form  d  ^  d_.  .  In  order 
X  X  min 


to  derive  the  lower  bound  d_.  ,  we  should  construct  an  x- 

min 


stream  profile  that  satisfies  (9),  (10. a)  and  (11. a),  and 
then  from  (12. a)  get 


dmin"  ’  E“l»..rn,  k—Vj^+1, . .  ,^2)  (15) 

z ,  k 


For  general  striped  matrices,  the  construction  of  such 
x-stream  profile  seems  difficult.  However,  for  certain 
types  of  stripe  matrices  the  construction  is  straight  for¬ 
ward  (see  [12]  for  examples). 


In  order  to  give  further  meaning  to  the  bound  (15),  it 
is  useful  to  consider  matrices  with  full,  non-overlapping, 
stripes.  In  this  case,  it  is  easy  to  see  that  the  contours 
given  by  (8)  are  the  actual  computation  fronts,  that  is 


t-k,Oj^(t-k)  ‘  > 


(16) 


Prom  (9)r  the  correaponding  x-atream  profile  la  given  by 

xPt(lc)  -  Oj^(t-k)  (17) 

%diich  by  the  very  propertied  of  the  atripea  aatiafy  (10. a) 
and  (ll.b).  Now,  from  (15), 

^min  - 

<  max{Oj^^j^(t-k)  -  Oj^(t-k)  } 

k,  t 

■  the  maximum  aeparation  between  the 
atripea  of  the  matrix. 

In  other  worda,  the  aeparation  between  the  atripea  deter- 
minea  the  minimum  aize  of  the  queuea  needed  on  the  x-atream 
comnunication  linka. 

Finally,  we  note  that,  with  finite  queuea  capacity,  the 
communication  protocol  ahould  not  allow  a  cell  to  %irrite  on  a 
queue  that  ia  full.  More  apecif Ically,  with  d^*!,  the 
abatement  02*-Ry  in  CYCLE  1,  ahould  be  interpreted  aa  "wait 
until  the  queue  la  not  full,  then  write  the  content  of  Ry”. 

^.2.  gQiwnun  teat  ton  ttma  la  Mftl/VEC 

Let  be  the  time  required  by  a  cell  in  MRT/VEC  to 
complete  a  floating  point  operation  (atep  3  of  CYCLE  1),  and 
let  be  the  time  required  to  move  a  data  item  from  the 
input  port  of  aome  cell  to  that  of  the  next  cell.  For  exam¬ 
ple,  a  data  item,  aay  x^,  may  be  moved  from  port  of  cell 
k  to  port  I.,  of  cell  k-1  in  time  r  .  Thia  includea  the  time 
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for  the  execution  of  step  2.1  of  CYCLE  1  and  the  associated 
protocols,  as  well  as  the  time  required  for  the  signals  to 
travel  on  the  communication  lines. 

In  order  to  estimate  the  execution  time  of  any  specific 
global  cycle,  we  assume  that  cell  k  executes  steps  2.1 
times  euid  step  2.2  times  during  the  communication 

phase  CP^  of  the  t^^  global  cycle.  Clearly,  ouid 

may  be  estimated  from  the  data  profiles  as  follows: 

f.  (k)  -  xP^(k)  -  xP._,(k)  (18. a) 

vlW  -  -  yPt-1^^)  (18. b) 

However,  each  cell  in  MAT/VEC  has  to  wait  until  all  the 
other  cells  complete  their  communication  activities.  Hence, 
the  duration  of  the  communication  phase  CP^  is  determined  by 
the  maximum  of  f^(lc)  and  for  all  k.  That  is  by 

-  max  I  k— TTj^, . . .  ,7^2)  (19. a) 

and 

Tl^  -  max  {i7^(k)  |  k— tTj^,  . . .  ,^2)  (19. b) 

Prom  CYCLE  1,  it  is  now  easy  to  see  that  the  time 

required  for  the  completion  of  global  cycle  t  is 

■^t  "  ""m  ™«f«t  '  ’'t^  "“c 

For  ‘Jy"!  y-stream  profile  is  given  by 

(14),  from  %^ich  we  find  that  and  hence  the  total  exe¬ 

cution  time  of  MAT/VEC  is 


H 


N 

'  E 


t-l 


Nr  +  T 
m  c 


N 

E 


(20) 


t-l 


v. 


/  V 


36 


The  values  of  t^l, . . . ,N,  depend  on  the  specific 
stripe  structure  of  the  input  matrix.  However,  without 
knowing  the  specific  stripe  structure  of  the  matrix,  we  may 
only  bound  the  execution  time  of  MAT/VEC  by 


‘max  ’  ”“<‘t  '  . *'> 

As  we  did  in  the  last  section,  we  may  give  further 
meouiing  to  considering  matrices  with  full,  non- 

IIIaX 

overlapping,  stripes.  For  this  type  of  matrices,  the  x- 
stream  profile  is  given  by  (17),  from  which  we  obtain 


€ 


max 


=  max  {o^(t-k)-Oj^(t-k-l) }  -  max{o^(  i)-o^(  i-1) } 
k,  t  k,  i 

the  maximum  slope  of  any  stripe  In  the  matrix. 


Note  that  equation  (20)  estimates  the  execution  time  of 
MAT/VEC  assuming  the  hypothetical  pseudo  systolic  synchroni¬ 
zation.  In  actual  execution,  however,  the  synchronization 
of  MAT/VEC  should  be  merely  data  driven  (no  wait  in  step  3 
of  CYCLE  1) ,  and  hence  faster  than  the  pseudo  systolic  exe¬ 
cution.  In  other  words,  the  value  of  T  given  by  (20)  is  an 
upper  bound  for  the  actual  execution  time  of  MAT/VEC.  This 
upper  bound  is  used  in  [12]  to  study  the  performance  of 
MAT/VEC  for  finite  element  matrices. 


£l.  The. 


of.  sparse  linear  ayatema 


In  this  section,  we  consider  one  of  the  most  efficient 
iterative  techniques  for  the  solution  of  linear  systems  of 
the  form  Ax»z  Neunely  the  preconditioned  conjugate  gradient 
method.  This  method  applies  conjugate  gradient  iterations 
to  the  system  M  x  =  M  ^z,  where  the  preconditioner  matrix 
M  is  a  suitable  approximation  of  A  In  many  precondition¬ 
ing  techniques,  such  as  incomplete  factorizations  [11]  and 
SSOR  [1],  the  matrix  M  may  be  expressed  as  the  product  of  a 
unit  lower  triangular  matrix  L,  a  diagonal  matrix  D,  and  a 
unit  upper  triauigular  matrix  U.  That  is  M  -  LDU.  The  pro¬ 
perty  that  makes  these  preconditioners  attractive  is  that 
the  matrices  L  and  U  have  the  same  zero  structures  as  the 
lower  and  upper  parts  of  the  matrix  A,  respectively. 

The  bulk  of  the  work  in  each  iteration  of  the  precondi¬ 
tioned  conjugate  gradient  method  is  the  multiplication  of 
the  matrix  A  by  a  vector,  and  the  solution  of  two  trieuigular 
linear  systems  of  the  forms  Ly-u  and  Uz-v. 

It  was  shown  in  the  previous  sections  that,  if  a  suit¬ 
able  stripe  structure  is  found  for  the  matrix  A,  then  the 
multiplication  of  A  by  any  vector  may  be  efficiently  exe¬ 
cuted  in  parallel  on  the  linear  network  MAT/VEC.  Here,  we 
show  that,  with  very  simple  modification,  MAT/VEC  may  also 
be  used  for  the  solution  of  euiy  unit  lower  triangular  system 

Ly-u.  The  unit  upper  triangular  system  Uz-v,  may  also  be 

T 

viewed  as  a  unit  lower  triangular  system  U  z'=>v',  where  z* 


euld  v*  are  obtained  by  reversing  the  elements  of  z  and  v, 
respectively.  That  Is,  If  the  order  of  U  Is  n,  then 


z  *  .  “Z  .  ,  and  z 
1  n-i+l 


l“^n-l+l* 


Assume,  as  before,  that  A  has  stripes 

S  <...<S  ,  where  S_  Is  a  full  stripe  that  contains  all 

-TTi  0 

the  diagonal  elements.  Hence,  L  has  v^+1  stripes  that  coin¬ 
cide  with  those  In  the  lower  triangular  part  of  A.  Namely, 
S_^^, ...,Sq. 

Let  MAT/VEC,  with  d  *1  and  be  applied  to  the 

y  X  min 

multiplication  of  L  by  any  vector  (Fig  6).  In  this  case, 
the  Inputs  on  ports  and  Ig  of  cell  0  are  not  needed 
because  all  the  elements  in  the  full  stripe 

SQ-»{(i,  1)  1  i»l, ,n)  are  supplied,  in  order,  to  port  Ig. 

For  the  same  reason,  the  counters  CX  and  CY  of  CYCLE  1  are 
not  needed  in  cell  0.  In  other  words,  the  cycle  of  cell  0 
may  be  described  by 


Figure  6  -  A  modified  version  of  MAT/VEC 

CYCLE  3: 


1)  Ra  -  [I3] 


2)  wait  until  data  is  available  on  euid  ; 


Rx  -  [I^]  ;  Ry  -  [I^] 

3)  Ry  »  Ry  +  Ra  *  Rx 

4)  Oj^  Rx  ;  •-  Ry- 

The  same  network  of  Fig  6  may  be  used  for  the  solution 
of  the  triangular  system  Ly>u  if  the  elements  of  the  vector 
u,  instead  of  the  elements  of  Sq,  are  supplied  to  of  cell 
0,  and  the  operation  of  this  particular  cell  is  described  by 
the  following  cycle  (instead  of  CYCLE  3). 

CYCLE  4: 

1)  Ra  =  [I3] 

2)  wait  until  data  is  available  on 

Ry  -  [4]  f 

3)  Ry  -  -Ry  /  Ra 

4)  •-  Ry  ;  0^  ••  Ry. 

We  call  the  resulting  network  TRIANG.  Note  that  the  input 
port  of  cell  0  in  TRIANG  is  not  used,  and  hence  no  data 

need  to  be  supplied  on  it.  Also,  the  elements  of  the  result 
vector  y  are  produced  on  the  output  port  O^. 

The  computation  fronts  for  the  pseudo  systolic  execu¬ 
tion  of  TRIANG  may  be  obtained  by  applying  our  earlier  rules 
* 

to  a  matrix  L  which  is  obtained  by  replacing  the  diagonal 

elements  in  L  by  the  elements  of  the  vector  u.  But  the  zero 
* 

structure  of  L  is  identical  to  that  of  L,  which,  in  turns, 
is  identical  to  the  lower  triauigle  of  A.  Hence,  the  stripe 


structure  of  A  may  be  used  to  estimate  the  execution  time  of 
TRIANG.  More  specifically,  if  the  stripes  of  A  are  strictly 
non-overlapping,  then  the  pseudo  systolic  execution  of  TRI¬ 
ANG  terminates  in  n  global  cycles.  The  capacities  of  the 
buffers,  as  well  as  the  communication  time,  in  TRIANG  may  be 
also  estimated  from  the  stripe  structure  of  A. 

Finally,  we  note  that  the  networks  MAT/VEC  and  TRIANG 
may  be  used  as  two  high  speed  peripheral  devices  for  a  host 
computer  that  executes  the  preconditioned  conjugate  gradient 
iterations.  In  fact,  only  MAT/VEC  is  needed  if  a  mechauiism 
is  availaible  for  switching  the  function  of  cell  0  between 
CYCLE  3  and  CYCLE  4. 


2.  CQMCLUSIQN 


We  Introduced  the  concept  of  striping  a  sparse  matrix, 
which  Is,  namely,  the  Inclusion  of  the  non-zero  elements  of 
the  matrix  In  a  structure  which  Is  regular  enough  to  allow 
for  efficient  parallel  manipulation.  Although  the  concept 
Is  general,  we  only  discussed  Its  application  to  the  design 
of  regular  VLSI  networks  for  speucse  matrices. 

The  operation  of  each  cell  In  the  networks  presented  In 
this  paper  are  data  dependent,  as  well  as  data-driven.  This 
makes  the  application  of  formal  euialysls  models  (e.g. 
[5,14])  extremely  difficult.  For  this  reason,  the  addi¬ 
tional  simplification  of  pseudo  systolic  synchronization  was 
assumed,  which  allowed  for  the  establishment  of  upper  bounds 
on  the  performance  of  the  networks.  The  actual  performance 
of  the  data-driven  networks  may  only  be  estimated  by  a 
detailed  simulation  of  the  computations. 

It  is  proved  that  for  an  nxn  matrix  with  n  non¬ 
overlapping  stripes,  (Tr<<n)  and  minimum  separation 
between  stripes,  the  multiplication  of  the  matrix  by  a  vec¬ 
tor  may  be  performed  In  n  global  cycles,  using  a  linear  net¬ 
work  of  n  cells  connected  by  links  that  can  buffer  data 

Items.  The  same  result  also  applies  to  the  solution  of  tri¬ 
angular  linear  systems. 

The  task  of  finding  a  stripe  structure  for  a  sparse 
matrix  may  be  accomplished  in  many  different  ways.  Including 
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the  algorithm  given  in  the  appendix.  However,  for  some 
types  of  matrices,  a  stripe  structure  may  be  naturally 
determined  from  the  underlying  problems.  For  example, 
stiffness  matrices  arising  in  finite  element  auialysis  are 
usually  generated  from  finite  grids,  emd  the  stripe  struc¬ 
ture  of  a  stiffness  matrix  is  directly  specified  by  the 
neighboring  relation  between  the  nodes  of  the  corresponding 
grid.  For  more  details  we  refer  to  [12]. 
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An  algorithm  (in  the  form  of  a  Fortran  subroutine)  is 
presented  for  the  determination  of  a  stripe  structure  for 
euiy  given  matrix  A.  The  structure  is  specified  by  a  matrix 
PA  in  which  each  column  k  specifies  the  positions  of  the 
elements  in  stripe  (see  equation  l.b).  More  precisely, 
if  PA(i,k)>j,  then  a.  ,  auid  if  PA(i,k)-0,  then,  there  is 

no  elements  in  row  i  that  belong  to  S^.  Note  that,  for  sim¬ 
plicity,  we  replaced  -b^^  in  (l.b)  by  0. 

The  linear  array  "Last”  should  be  set,  in  the  calling 
program,  such  that  Last(i),  l^i^n,  contains  the  number  of 
non-zero  elements  in  row  i  of  the  matrix  A.  Also,  the 
integer  "nstrip"  should  be  set  to  the  maximum  of  Last(i), 
i*l,...,n.  The  column  numbers  of  the  ”Last(i)"  elements  in 
row  i  of  A  are  initially  stored  in  the  first  ”Last(i)”  loca¬ 
tions  of  row  1  of  PA  (see  Fig  7).  The  subroutine,  then, 
transforms  this  input  form  of  PA  into  the  one  that  specifies 
a  stripe  structure  of  A.  The  number  of  stripes  is  also 
returned  In  the  variable  "nstrip”.  Following  is  the  algo¬ 
rithm: 

subroutine  stripes (PA, last,n,nstr ip) 
integer  PA(n,l),  last(n) 
c 

j  -  0 

10  j  -  j  +  1 

do  1000  i«2,n 
c 

c  Find  the  previous  element  in  the  current  stripe 

200  m  -  0 


m  -  m  +  1 

if(i-ni  -LT.  0)  go  to  1000 
if (PA( i-m, j )  .EQ.  0)  go  to  300 

**  If  stripe  la  not  strictly  Increasing,  then  shift 
**  row  i-m  by  one  position  starting  at  column  j  ** 

if(PA(l-m,j)  .GE.  P(i,3))  then 

call  shlft(PA, last,n,nstr Ip, l-m,3) 
go  to  300 
end  If 


lf(j  .LT.  nstrlp)  go  to  10 


li 


45 


I 


h-^ 


V! 


Finally,  we  note  that  it  la  possible  to  modify  the 
above  algorithm  such  that  the  resulting  stripes  are  non¬ 
overlapping.  However,  it  seems  impossible  to  enforce  the 
condition  of  strictly  non-overlapping  stripes.  In  fact,  the 
existence  of  a  stripe  structure  with  strictly  non¬ 
overlapping  stripes  is  not  always  guarau\teed.  The  matrix  of 
Fig  7(a)  is  eui  example  for  which  such  structure  does  not 
exist. 
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